Author: Richard Barker
Date: 2018-08-18
The analyses reported in this document are part of the GeneLab Ecotypes in space project. The aim is to find features that are differentially expressed between Col, Cvi, Ler and WS2. The statistical analysis process includes data normalization, graphical exploration of raw and normalized data, test for differential expression for each feature between the conditions, raw p-value adjustment and export of lists of features having a significant differential expression between the conditions.
The analysis is performed using the R software [R Core Team, 2014], Bioconductor [Gentleman, 2004] packages including edgeR [Robinson, 2010] and the SARTools package developed at PF2 - Institut Pasteur. Normalization and differential analysis are carried out according to the edgeR model and package. This report comes with additional tab-delimited text files that contain lists of differentially expressed features.
For more details about the edgeR methodology, please refer to its related publications [Robinson, 2007, 2008, 2010 and McCarthy, 2012].
The count data files and associated biological conditions are listed in the following table.
| SampleLabel | File | Treatment | Genotype |
|---|---|---|---|
| Col_SpaceFlight_1 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | SpaceFlight | Col |
| Col_SpaceFlight_2 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | SpaceFlight | Col |
| Col_SpaceFlight_3 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | SpaceFlight | Col |
| Col_SpaceFlight_4 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | SpaceFlight | Col |
| Col_Ground_1 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | Ground | Col |
| Col_Ground_2 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | Ground | Col |
| Col_Ground_3 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | Ground | Col |
| Col_Ground_4 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | Ground | Col |
| Cvi_SpaceFlight_1 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | SpaceFlight | Cvi |
| Cvi_SpaceFlight_2 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | SpaceFlight | Cvi |
| Cvi_SpaceFlight_3 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | SpaceFlight | Cvi |
| Cvi_Ground_1 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | Ground | Cvi |
| Cvi_Ground_2 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | Ground | Cvi |
| Cvi_Ground_3 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | Ground | Cvi |
| Ler_SpaceFlight_1 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | SpaceFlight | Ler |
| Ler_SpaceFlight_2 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | SpaceFlight | Ler |
| Ler_SpaceFlight_3 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | SpaceFlight | Ler |
| Ler_Ground_1 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | Ground | Ler |
| Ler_Ground_2 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | Ground | Ler |
| Ler_Ground_3 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | Ground | Ler |
| WS2_SpaceFlight_1 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | SpaceFlight | WS2 |
| WS2_SpaceFlight_2 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | SpaceFlight | WS2 |
| WS2_SpaceFlight_3 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | SpaceFlight | WS2 |
| WS2_SpaceFlight_4 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | SpaceFlight | WS2 |
| WS2_Ground_1 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | Ground | WS2 |
| WS2_Ground_2 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | Ground | WS2 |
| WS2_Ground_3 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | Ground | WS2 |
| WS2_Ground_4 | DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt | Ground | WS2 |
After loading the data we first have a look at the raw data table itself. The data table contains one row per annotated feature and one column per sequenced sample. Row names of this table are feature IDs (unique identifiers). The table contains raw count values representing the number of reads that map onto the features. For this project, there are 41671 features in the count data table.
| Col_SpaceFlight_1 | Col_SpaceFlight_2 | Col_SpaceFlight_3 | Col_SpaceFlight_4 | Col_Ground_1 | Col_Ground_2 | Col_Ground_3 | Col_Ground_4 | Cvi_SpaceFlight_1 | Cvi_SpaceFlight_2 | Cvi_SpaceFlight_3 | Cvi_Ground_1 | Cvi_Ground_2 | Cvi_Ground_3 | Ler_SpaceFlight_1 | Ler_SpaceFlight_2 | Ler_SpaceFlight_3 | Ler_Ground_1 | Ler_Ground_2 | Ler_Ground_3 | WS2_SpaceFlight_1 | WS2_SpaceFlight_2 | WS2_SpaceFlight_3 | WS2_SpaceFlight_4 | WS2_Ground_1 | WS2_Ground_2 | WS2_Ground_3 | WS2_Ground_4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AT1G01010.1 | 297 | 411 | 248 | 258 | 303 | 285 | 131 | 332 | 172 | 172 | 274 | 551 | 271 | 154 | 392 | 161 | 254 | 342 | 96 | 25 | 162 | 49 | 85 | 121 | 271 | 93 | 44 | 32 |
| AT1G01020.1 | 23 | 50 | 34 | 47 | 84 | 40 | 97 | 70 | 24 | 29 | 75 | 113 | 67 | 32 | 74 | 47 | 90 | 87 | 27 | 6 | 52 | 5 | 14 | 10 | 26 | 9 | 6 | 5 |
| AT1G01020.2 | 22 | 44 | 30 | 44 | 83 | 32 | 93 | 60 | 20 | 27 | 61 | 108 | 65 | 33 | 71 | 46 | 80 | 81 | 21 | 6 | 42 | 4 | 13 | 8 | 22 | 9 | 6 | 4 |
| AT1G01030.1 | 18 | 42 | 22 | 23 | 79 | 40 | 30 | 36 | 42 | 29 | 41 | 11 | 12 | 7 | 30 | 29 | 72 | 17 | 26 | 2 | 56 | 6 | 29 | 32 | 49 | 14 | 12 | 14 |
| AT1G01040.1 | 262 | 303 | 229 | 225 | 477 | 324 | 345 | 517 | 279 | 323 | 659 | 850 | 411 | 269 | 545 | 225 | 402 | 414 | 242 | 20 | 370 | 70 | 200 | 138 | 500 | 149 | 120 | 64 |
| AT1G01040.2 | 236 | 281 | 213 | 204 | 447 | 303 | 329 | 475 | 261 | 309 | 612 | 801 | 374 | 254 | 512 | 214 | 360 | 367 | 226 | 18 | 342 | 67 | 188 | 129 | 470 | 128 | 117 | 59 |
Looking at the summary of the count table provides a basic description of these raw counts (min and max values, median, etc).
| Col_SpaceFlight_1 | Col_SpaceFlight_2 | Col_SpaceFlight_3 | Col_SpaceFlight_4 | Col_Ground_1 | Col_Ground_2 | Col_Ground_3 | Col_Ground_4 | Cvi_SpaceFlight_1 | Cvi_SpaceFlight_2 | Cvi_SpaceFlight_3 | Cvi_Ground_1 | Cvi_Ground_2 | Cvi_Ground_3 | Ler_SpaceFlight_1 | Ler_SpaceFlight_2 | Ler_SpaceFlight_3 | Ler_Ground_1 | Ler_Ground_2 | Ler_Ground_3 | WS2_SpaceFlight_1 | WS2_SpaceFlight_2 | WS2_SpaceFlight_3 | WS2_SpaceFlight_4 | WS2_Ground_1 | WS2_Ground_2 | WS2_Ground_3 | WS2_Ground_4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1st Qu. | 0 | 0 | 1 | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Median | 22 | 28 | 23 | 26 | 34 | 28 | 20 | 38 | 19 | 21 | 37 | 48 | 26 | 15 | 46 | 16 | 31 | 34 | 14 | 2 | 23 | 5 | 11 | 10 | 27 | 9 | 5 | 4 |
| Mean | 285 | 317 | 283 | 293 | 379 | 319 | 262 | 449 | 231 | 278 | 399 | 495 | 262 | 177 | 479 | 213 | 366 | 372 | 189 | 48 | 311 | 121 | 212 | 139 | 362 | 152 | 66 | 50 |
| 3rd Qu. | 137 | 170 | 130 | 142 | 195 | 157 | 112 | 213 | 107 | 115 | 193 | 266 | 148 | 88 | 231 | 92 | 175 | 185 | 81 | 14 | 129 | 31 | 69 | 57 | 151 | 50 | 32 | 22 |
| Max. | 1332155 | 949659 | 1684886 | 2117410 | 2003292 | 1946182 | 1419028 | 2259068 | 1054556 | 1717036 | 1568931 | 1565851 | 905457 | 412651 | 1291992 | 831156 | 1215163 | 1677741 | 970559 | 599191 | 1722356 | 1405776 | 2071750 | 507645 | 1742480 | 1936451 | 312860 | 317225 |
Figure 1 shows the total number of mapped reads for each sample. Reads that map on multiple locations on the transcriptome are counted more than once, as far as they are mapped on less than 50 different loci. We expect total read counts to be similar within conditions, they may be different across conditions. Total counts sometimes vary widely between replicates. This may happen for several reasons, including:
Figure 2 shows the proportion of features with no read count in each sample. We expect this proportion to be similar within conditions. Features with null read counts in the 28 samples will not be taken into account for the analysis with edgeR. Here, 1717 features (4.12%) are in this situation (dashed line).
Figure 3 shows the distribution of read counts for each sample. For sake of readability, \(\text{log}_2(\text{counts}+1)\) are used instead of raw counts. Again we expect replicates to have similar distributions. In addition, this figure shows if read counts are preferably low, medium or high. This depends on the organisms as well as the biological conditions under consideration.
It may happen that one or a few features capture a high proportion of reads (up to 20% or more). This phenomenon should not influence the normalization process. The edgeR normalization has proved to be robust to this situation [Dillies, 2012]. Anyway, we expect these high count features to be the same across replicates. They are not necessarily the same across conditions. Figure 4 illustrate the possible presence of such high count features in the data set.
We may wish to assess the similarity between samples across conditions. A pairwise scatter plot is produced (figure 5) to show how replicates and samples from different biological conditions are similar or different (\(\text{log}_2(\text{counts}+1)\) are used instead of raw count values). Moreover, as the Pearson correlation has been shown not to be relevant to measure the similarity between replicates, the SERE statistic has been proposed as a similarity index between RNA-Seq samples [Schulze, 2012]. It measures whether the variability between samples is random Poisson variability or higher. Pairwise SERE values are printed in the lower triangle of the pairwise scatter plot. The value of the SERE statistic is: